Project 2: Clustering

MyUH: 1867424

Dosbol Aliev

Background Info

Write a “problem statement” and an introductory paragraph that clearly explains your goals, it
should include at least the following info:
● Describe your dataset, why you picked it, and write a small paragraph discussing your goal
with your dataset, what models you can use to analyze it, and why.

The goal of this project is to find the best way to characterize the variety of consumers that a wholesale distributor deals with. Since this is a Clustering Project I will use K-Means Clustering and Agglomerative Hierarchical Clustering and will compare results

Import Data

There are a total of 440 observations and 8 attributes in the dataset. As we can see, every features have 440 non-null means that there are no missing Values

printing the number of UNIQUE values in each attribute

CONTINUOUS FEATURES

1 Fresh

2 Milk

3 Grocery

4 Frozen

5 Detergents_Paper

6 Delicassen

CATEGORICAL FEATURES

1 Channel

2 Region

Data Exploration & Visualization

In this plot, we have three different cities. Median Los Angeles residents consume 3.6k and Houston 3.7 and New York 2.3k

The above heatmap shows the correlation between variables on each axis. We observe that there exists severe collinearity issues exist between Milk&Detergents_Paper,Grocery&Detergents_Paper and Milk&Grocery since there are correlations are > 0.7 and >0.9 between the two features.

As we can see diagram above, Los Angeles and Houston have the same number of Retails and New York slightly less. For Horeca, They have almost the same amount.

Above we have distirubutions for our features

It is a pair plot of our features described by different plots

This diagram tells about the relationship between Milk and Fresh by scatterplot

This diagram tells about the relationship between Milk and Froszen by scatterplot.

Feature Scaling

Questions:

  1. Is it necessary to scale the data? What benefits would it provide?
  2. Which scaler will you use for this data set? Min Max, Standard, Robust, etc.
  3. Are the features or the response variables scaled? *Don’t forget to split your data into test-train splits before scaling!

ANSWERS:

1 Yes, of course,Yes. Scaling Methods must scale them before features are supplied to clustering methods like K-means. Because Euclidean Distance is used to build cohorts in clustering algorithms, it's good to scale variables with heights in meters and weights in kilograms before computing the distance.

  1. I will use the Standard Scaler method because usually, it scale nicely

3.Yes, features variables are scaled

Since we are dealing with Clustering we don't need to use Test_train_split method

Data Preprocessing

Now that you’ve really studied your data, are you going to take any preprocessing measures? For which columns, and why? Define any measures you’ve taken and address why you chose to do so.

Yes, I will take preprocessing measures for Channel, Region, and Total columns because they are not helpful when we model our data so, the easiest way is just drop those columns

FIRST MODEL: KMeans CLustering

Find optimum value of K

Build the Model

We can see that 4th cluster has maximum number of samples, while 5th cluster has minimum number of samples.

Second MODEL: Agglomerative Clustering

Model

By now, you should have completed any necessary scaling, encoding, preprocessing measures. Next steps would be creating and training your model. Split your data into training and testing. sets. Explain what 2 models you’ve chosen and why? Explain and define each model, giving background info, its uses, why it’s beneficial, why you chose it over another model, and compare both models you’ve chosen. Is your approach parametric or non parametric? Which features were most important? How did both models perform? Make sure you give all relevant info. Show the confusion matrix and classification report. Write a conclusion wrapping up your findings.

Before starting this project, I researched to find the best Clustering Machine learning algorithms. So, as we know that there are more than 100 Clustering algorithms, so on the one website

I saw top 7 ML Clustreing algorirthm:

1. Agglomerative Hierarchical Clustering 

2. Balanced Iterative Reducing & Clustering
3. EM Clustering  
4. Hierarchical Clustering
5. Density-Based Spatial Clustering  
6. K-Means Clustering
7. Ordering Points To Identify the Structure of Clustering

So, from the above Clustering algorithms, I chose Agglomerative Hierarchical Clustering and K-Means Clusterin. The reason for choosing these algorithms is that we did a lab experiment, and in the lecture professor also explained to us how they work. In other words, only these two algorithms were familiar to me. So, we do have to scale before modeling for Clustering, but Clustering Algorithms do not require train_split_test because they are UNsupervised learning algorithms. Both ML algorithms are non-parametric the less critical features were "Channel" and "Region." K-Means Clustering gave SSE score of 1058.77125325701, while Agglomerative Clustering gave me three Clustering as you can see from above diagram. It is nicely divided into three groups.